RNA Sequencing Workshop

Tom Hardiman Cancer Bioinformatics Group King’s College London

October 2018

Introduction

In this workshop we will be analysing RNA sequencing data, starting with the raw input files and finishing with a list of differentially expressed genes.

The data used in this workshop is from real samples however, a subset of the sequencing reads have been used. This allows for the pipeline to be run in its entirety in the time available in this workshop. Running this pipeline on a full set of samples will typically take 48 – 72 hours to complete, depending on the number of reads sequenced and the computer processing power available.


Experimental setup

For this workshop we will be working with a total of four RNA sequencing samples collected from primary breast tumours. Two of these samples are HER2 positive with the remaining two being HER2 negative. HER2-positive breast cancer is a breast cancer that tests positive for a protein called human epidermal growth factor receptor 2 (HER2), which promotes the growth of cancer cells. These samples are from the same patients which were used in the copy number tutorial previously. More information about the four available samples is shown in the table below. To reduce the processing time during this workshop a subset of 5,000,000 reads have been taken at random from each of the samples (approximately 5% of a sample’s reads).


Sample Name HER2 status
sample_1 negative
sample_2 negative
sample_3 positive
sample_4 positive


Workflow overview

The steps performed in the workshop follow the typical workflow used to analyse RNA sequencing data.

We will start by assessing the initial quality of the raw sequencing data, performing any trimming or filtering if necessary and then re-assessing this data. After the raw data has been cleaned we will be using two methods to quantify the expression of genes. Finally, we will carry out differential gene expression analysis and investigate the differences between the two methods. An overview of the workflow being followed is shown below.



Tools

To facilitate this analysis, we will be using several tools (shown below), At several stages in the workflow we will be using MulitQC, this is an extremely useful tool to create a summary report for all samples together for a whole range of bioinformatic tools.



Detailed tool list




Workshop setup

First we need to setup the enviroment to analyse the data in the workshop. This analysis will be performed in an interactive session, which is setup as followed:

# Setting master paths for the whole workshop

output_dir_master=~/RNA_Seq_workshop/output
raw_rna_seq_folder=~/RNA_Seq_workshop/raw_reads
resources=~/RNA_Seq_workshop/resources




Initial quality control



It is essential some initial quality control assessments are performed on RNA sequencing FASTQ files when they are received from a sequencing facility. This should include assessment of the GC percentage distribution (proportion of guanine and cytosine base pairs across the reads), assessing the overall read quality and investigate overrepresented sequences.

To perform initial quality assessment of RNA sequencing data, the FastQC tool is used. This tool is simply run by using the command below:

# Loading FastQC
module load fastqc/0.11.8 

# Setting and creating output folder
outputdir=$output_dir_master/fastqc/initial
mkdir -p $outputdir

# Loop to run FastQC
cd $raw_rna_seq_folder
for dir in */ ; do # Outer loop around sample directories
        cd "$raw_rna_seq_folder/$dir"
        for file in * ;do # inner loop around forward and reverse reads for each sample
             filepath="$raw_rna_seq_folder/$dir$file"
             echo $filepath
             # FastQC comand - exlained in detail in section below
              fastqc $filepath \
                --outdir=$outputdir

        done
done
  • $fastq_file - Path to the .fasta (or .fa) file to be assessed

  • --outdir=$path_for_report - The path to where the report should be outputted to

This tool cannot handle paired end samples together so forward and reverse reads must be assessed separately, with a separate .html report for each. These reports can be viewed by navigating to the output folder and using firefox. e.g firefox $output_dir_master/fastqc/initial/sample_1_1_fastqc.html


Understanding the report

The following sections are addressed in the report:

Basic statistics

Much of the statistics shown in this section are straightforward and easy to understand. When working with paired end samples, it is important to check that both the forward and reverse read files contain the same number of reads. The percentage of GC bases is also calculated and this should be checked to make sure that it is within the normal range for the species being investigated.

Per base sequence quality

The Phred scale quality represents the probability p that the base call is incorrect. A Phred score Q is an integer mapping of p where Q = -10 log10 p. Briefly, a Phred score of 10 corresponds to one error in every 10 base calls or 90% accuracy; a Phred score of 20 to one error in every 100 base calls or 99% accuracy. The maximum Phred score in any sample is 40.

Overall, the quality of reads at each of the bases across all the samples looks to be good. It is a good idea however to still investigate the individual FastQC reports, as these give a range for phred scores, instead of only the average.

Looking at each sample’s individual plot indicates that some reads have lower quality bases at the begining and end of their sequence. Although not strictly necessary, we will apply a trimming tool to remove these low quality bases.

GC distribution

The GC distribution across reads is expected to approximately follow a theoretical normal distribution. If the curve presents a shoulder in a region of high GC content, this can indicate that rRNA is present in the sample. However, it may also represent contamination by an organism with a higher GC content. In contrast, a peak on the left hand side would indicate enrichment for A/T rich sequences. In particular a sharp peak for very low GC content (in the 0-3 range) is usually indicative of the sequencing of the mRNA poly-A tails. If this plot still shows issues after quality and rRNA filtering, additional steps would have to be taken to filter contaminants.

The GC distribution from all samples shows the desired normal distribution. Sample 4 has been flagged with a warning by FastQC, however a normal distribution is still obserbed in this sample and the distribution is still close to the other three samples meaning this is not a cause for concern. In addition, all other quality metrics do not indicate any issues with this sample.

Overrepresented sequences

Overrepresented sequences have been identified in four of eight .fastq files, with all of these sequences identified as Illumina adapters. These adapters sequences will be removed using the trimming tool

Additional plots

FastQC produces several additional QC plots, these plots are not informative for RNA sequencing data and will not be used in assessing the initial quality of the data.




Trimming



Based on the initial QC of the data, we will be performing quality and adapter sequence trimming. In this process, bases of reads which are low quality or match adapter sequences will be dropped, with the whole read being removed if large proportions of it match this criteria.

Adapter sequences depend on the library kit used during the sequencing, so it is important to check you are using the correct adapters during trimming. There are several tools designed to perform trimming, in this workshop we will be utilising Trimmomatic.

# Setup paths for files
tmtic_jar_file=/opt/apps/Trimmomatic-0.38/trimmomatic-0.38.jar
adapters=$resources/adapters/adapter_sequences.fa


# Loop to run Trimmomatic

cd $raw_rna_seq_folder

for dir in */ ; do #Looping around sample directories
      # Make output folder for each sample
        outputdir=$output_dir_master/trimmomatic/$dir
        mkdir -p $outputdir
        
      cd "$raw_rna_seq_folder/$dir"
      
      # Input files
      forward_fastq_input_file="$raw_rna_seq_folder/$dir"$(ls | grep _1.fq)
      reverse_fastq_input_file="$raw_rna_seq_folder/$dir"$(ls | grep _2.fq)
      # Sample name
      sample_name=$(ls | grep _1.fq)
      sample_name=${sample_name%?????} #remove file extention and "_1" to leave just sample name
      
      # Output files
      output_forward_paired_fastq_file="$outputdir$sample_name"_1_trimmed_paired.fq
      output_reverse_paired_fastq_file="$outputdir$sample_name"_1_trimmed_unpaired.fq
      output_forward_unpaired_fastq_file="$outputdir$sample_name"_2_trimmed_paired.fq
      output_reverse_unpaired_fastq_file="$outputdir$sample_name"_2_trimmed_unpaired.fq
      
      # Trimmomatic comand - exlained in detail in section below
        java -jar -Xmx20G $tmtic_jar_file PE -phred33 \
                        $forward_fastq_input_file $reverse_fastq_input_file \
                        $output_forward_paired_fastq_file $output_reverse_paired_fastq_file \
                        $output_forward_unpaired_fastq_file $output_reverse_unpaired_fastq_file \
                        ILLUMINACLIP:$adapters:2:30:10 \
                        LEADING:30 TRAILING:30 MINLEN:36
done
  • java -jar - Calling Java to run Trimmomatic

  • -Xmx20G - Set memory to 20 GB

  • $tmtic_jar_file - Path to Trimmomatic’s .jar file

  • PE - Setting Trimmomatic to paired end sequencing mode

  • -phred33 - Setting quality score encoding to Phred33

  • $forward_fastq_input_file $reverse_fastq_input_file - Paths to the raw forward and reverse fastq files.

  • $output_forward_paired_fastq_file $output_reverse_paired_fastq_file - The file path and name of the paired output from the trimming process - these are the files which will be used in downstream analysis

  • $output_forward_unpaired_fastq_file $output_reverse_unpaired_fastq_file - The file path and name of the unpaired output from the trimming process - these files contain reads where its corresponding partner was removed and thus are excluded from downstream analysis

  • ILLUMINACLIP:$adapter_file - Cut adapter and other illumina-specific sequences from the read

  • LEADING:30 - Cut bases off the start of a read, if below a threshold quality of 30

  • TRAILING:30 - Cut bases off the end of a read, if below a threshold quality of 30

  • MINLEN:36 - Drop the read if it is below a specified length


The results from Trimmomatic (shown below) show that over 95% of paired reads in each sample remain after trimming. This agrees with our initial assessment that the data was of good quality.

(As you are running your jobs in an interactive environment, the logs required by MultiQC are not made meaning you cannot make this plot for yourself)




Additional filtering & quality assessment steps

In addition to the trimming performed in this workshop there are other filtering steps which can be performed, depending on the data being analysed.

Ribosomal filtering

If a ribosomal RNA depletion kit was used during the library prep of the sequencing data, it can be beneficial to perform a ribosomal filtering step. Ribosomal depletion kits are never 100% effective meaning that some reads originating from ribosomal RNA will remain. These remaining reads can be filtered using the SortMeRNA tool.

Based on our initial assessment of the data there was no indication that any ribosomal contamination was present in the samples. For completeness, SortMeRNA has been run on the data for you with the results summarized in the plot below. These results show that a maximum of 1.3% reads were identified as being of ribosomal in origin (counts must be divided by 2 as the sample is paired end). Additionally a small amount of bacterial contamination is potentially identified in sample 2 (black and pink band) however any potential contamination is at extremely low levels (0.019% of reads) and is not of concern.

Identifying contamination

During the library preparation and sequencing process, it is possible for a sample to become contaminated with a foreign species’ RNA. Having the ability to identify potential contamination is an important step in assessing the quality of sequencing data. This can be achieved by using the FastQ Screen tool.

There is no indication that contamination is an issue in these samples but again for completeness the tool has been run for you. Mouse, Rat and E Coli were investigated as potential contamination sources. The results show that no unique reads were mapped to any of these species (blue bars) however, some reads aligned to multiple species and this is to be expected. A small portion of reads were not signed to any of the supplied species genome. This could be indication of contamination by an unknown species, however at these low levels it is more likely to be human reads which the tool failed to align to the genome. This failure may be due to the fact that the underlying aligner for this tool is not splice aware and it not best suited for handling RNA sequencing data.




Post trimming / filtering quality control



Now that we have completed our cleaning of the data, we re-assess it using FastQC. This is achived using the command below.

# Setting and creating output folder
outputdir=$output_dir_master/fastqc/post_trim
mkdir -p $outputdir

# Loop to run FastQC
trim_output=$output_dir_master/trimmomatic
cd $trim_output
for dir in */ ; do # Outer loop around sample directories
        cd "$trim_output/$dir"
        for file in *trimmed_paired.fq ;do # inner loop around forward and reverse reads for each sample
             filepath="$trim_output/$dir$file"
             echo $filepath
             # FastQC comand - exlained in detail in section below
              fastqc $filepath \
                --outdir=$outputdir

        done
done
  • $fastq_file - Path to paired output file from trimmomatic - forward and reverse files must be assessed separately

  • --outdir=$path_for_report - The path to where the report should be outputted to


These reports can be viewed by navigating to the output folder and using firefox, for example: firefox $output_dir_master/fastqc/post_trim/sample_1_1_fastqc.html

We will now look at the quality metrics we wanted to improve by performing trimming.

Per base sequence quality

The individual plots for each sample show that the trimming has been successful in removing the poor quality portions at the begining and end of the reads.

Overrepresented sequences

Looking at the overrepresented sequences section of the report shows there is no longer any adapter contamination, again confirming that the trimming was successful.




Alignment methods

There are two methods which can be used to align reads to a reference sequence. The first method (conventional alignment) uses a reference genome and maps reads to this. The second (pseudo-alignment) uses a reference transcriptome, with reads being matched to this. Both methods are discussed in more detail below.

Conventional alignment



There are many tools available for aligning reads to a reference genome. In this workshop we will be using HISAT2. The main advantage of this tool is the speed at which it can map reads & its low memory requirements.


Creating a genome index

Before reads can be mapped, an index of the reference genome must be made. This indexing step allows for reads to be mapped at a much faster speed and is integral to the operation of HISAT2

For this workshop, a reference genome for homo sapiens was obtained from Ensembl and an index of the genome has been pre-made for you. This index is located in the ~/resources/genome/hisat2_index/ folder. The code used to produce this index is provided in the appendix of this workshop.


Mapping reads

Using the pre-made index, the trimmed reads can be mapped to the reference genome with the code below.

module load samtools/1.8-gcc
module load hisat2/2.1.0 

trim_output=$output_dir_master/trimmomatic


# Path to genome index
genome_index=$resources/genome/hisat2_index/human

cd $trim_output
for dir in */ ; do # loop around sample directories
        # Make output dir
        outputdir=$output_dir_master/HISAT2/$dir
        mkdir -p $outputdir
        
        # Create paths for outfile from HISAT2 and samtools
        outfile_HISAT2=$outputdir${dir%?}.sam
        outfile_samtools=$outputdir${dir%?}_sorted.bam
      
        cd "$trim_output/$dir"
        forward="$trim_output/$dir"$(ls | grep 1_trimmed_paired.fq)
        reverse="$trim_output/$dir"$(ls | grep 2_trimmed_paired.fq)
        
        echo $outputdir
        # HISAT2 comand - exlained in detail in section below
        hisat2 -x $genome_index \
                -1 $forward \
                -2 $reverse \
                -S $outfile_HISAT2
      
      # Convert .sam produced by HISAT2 to a sorted .bam file using SAMtools
      
        samtools view -bh $outfile_HISAT2 | samtools sort -o $outfile_samtools

done

HISAT2

  • hisat2 - Calling HISAT2

  • -x $genome_index - Path to the genome index folder (including basename) - this has been provided for you in this workshop

  • -1 $forward - Path to forward trimmed paired .fa files

  • -2 $reverse - Path to reverse trimmed paired .fa files

  • -S $outfile_HISAT2 - Path (including file name) to the output .sam file created by HISAT2


The output of this process will be a .sam file, these Sequence Alignment Map files take up much more storage space than their binary equivalent (.bam). To convert these .sam files to .bam files we use SAMtools

SAMtools

  • samtools view -bh $outfile_HISAT2 - Convert .sam file to a .bam file.

  • samtools sort -o $outfile_samtools - Sort the .bam based on genomic cordinates


HISAT2 sucessfully aligns at least 97% of reads for each sample to the refence genome, with at least 88% of these being aligned only once.

Pseudo-alignment



Pseudo-alignment has the benefit over conventional aligners in that it can better correct for bias in RNA sequencing data. There are some drawbacks to this method, mainly the transcriptome of the species must be well documented, this limits it use to mainly model organisms. As no actual aligning is performed, it is also not possible to extract positional information, in addition reads containing many mutations will not be mapped correctly.

One of the additional advantages to using pseudo-aligners is that they can quantify reads much faster and with less computational power than conventional aligners. When working with extremely small datasets, like the one in this tutorial, the opposite is observed. During the pseudo-alignment process, reading in the index as well as pre-processing takes several minutes regardless of the sample size, this can result in very small datasets taking longer to process with pseudo-aligners. Due to this, we will not be running Salmon in this workshop. The output from Salmon has been provided for you, as well as the code used to generate them.

There are two main tools for performing pseudo-alignment to map reads to the transcriptome - Salmon & Kallisto. In this workshop we will be demonstrating Salmon, there is no real advantage of one of these over the other.


Creating a transcriptome index

Much like HISAT2, the first step is to create an index. The main difference however is that this index is based on a transcriptome instead of the genome. The transcriptome reference is also obtained from Ensembl. The code used to produce this index is provided in the appendix of this workshop.


Quantifying transcripts

Now that the index of the transcriptome has been made, transcripts can now be quantified from the trimmed reads. The code used to run Salmon is provided in the appendix of this workshop.

The output of this process will be transcript level quantification which can be directly used for downstream analysis such as differential gene expression. Further steps are required however get to the same stage with the conventionally aligned data.

The pre-processed Salmon results are can be found in ~/resources/salmon. Salmon sucessfully aligned a miniumn of 89.6% reads for each sample.




Post alignment quality control



To assess the quality of the alignment produced by HISAT2, the Qualimap tool is used.

module load qualimap/2.2.1

HISAT2_output=$output_dir_master/HISAT2

cd $HISAT2_output
for dir in */ ; do # loop around sample directories
        # Make output dir
        outputdir=$output_dir_master/qualimap/$dir
        mkdir -p $outputdir
        sample_name=${dir%?}_rna_report
        sorted_bam_from_HISAT2="$HISAT2_output"/"$dir"${dir%?}_sorted.bam
        
        echo $outputdir
        qualimap rnaseq -bam $sorted_bam_from_HISAT2 \
            -gtf $resources/genome/GTF_file/Homo_sapiens.GRCh38.92.gtf -outdir $outputdir \
            -p strand-specific-reverse -pe  \
            -outfile $sample_name -outformat HTML \
            --java-mem-size=20G

done
  • rnaseq - Sets Qualimap to rna-seq mode

  • -bam $sorted_bam_from_HISAT2 - Path to the sorted .bam file generated by HISAT2 & SAMtools

  • -gtf $gtf_file - Path to the .gtf file (Gene Transfer Format) containing annotations for the .fa used for generation of HISAT2 index

  • -p strand-specific-reverse - The strand-specificity of the sequencing library (if this is unknown then this can be obtained from the Salmon log)

  • -pe - Turning on paired end mode

  • -outfile $outputdir - The base name for the report that will be generated

  • -outformat HTML - Generate an .html report (.pdf also available)

  • --java-mem-size=5G - Setting the memory available for qualimap (this memory must also be available on the server node)

When running this code, a warning message will be displayed that chromosomes from reads are not found in the annotations. This warning should be ignored as it as result of a quirk in the way HISAT2 creates its output which triggers the warning in quailmap.

The genetic origin plot looks as expected, with a majority of reads are classed as exonic. Reads called as intergenic could be a result of reads which had not yet been spliced when RNA was extracted, reads which overlap the exon boundries or reads of unknown spliced transcripts. Intergenic reads are likely non-coding RNA

The coverage profile again shows no issues with the data or the alignment produced by HISAT2. Extreme low coverage at the begining of reads can be an indication of poor RNA quailty. Assessing coverage is particularly important when working with RNA extracted from FFPE samples. More information on this can be found in this paper.




Transcript quantification



To produce gene or transcript level counts from the HISAT2 aligned data FeatureCounts software package is used. In this workshop we are not interested in looking at transcript level, so will be be using gene level summarization.

HISAT2_output=$output_dir_master/HISAT2


featureCounts=/opt/apps/subread-1.6.2/bin/featureCounts

mkdir -p $output_dir_master/featureCounts/

outfile=$output_dir_master/featureCounts/counts.txt

cd $HISAT2_output

declare -a files

for dir in */ ; do # loop around sample directories
    cd "$HISAT2_output/$dir"
    filename=${dir%?}
    fullpath="$HISAT2_output/$dir""$filename"_sorted.bam
    files+=($fullpath)
done

$featureCounts ${files[*]} -a $resources/genome/GTF_file/Homo_sapiens.GRCh38.92.gtf -O --fraction -M -o $outfile  -p -S 2 -T 1

#to clear 'files' variable: files=()
  • ${files[*]} - Path to sorted .bam file generated by HISAT2 & SAMtools for all samples

  • -a $gtf_file - Path to the .gtf file (Gene Transfer Format) containing annotations for the .fa used for generation of HISAT2 index

  • -O - Allows reads to be assigned to more than one feature in the .gtf file

  • --fraction - Assign fractional counts to features (more infomation can be found in the software’s user manual)

  • -M - multi-mapping reads will be counted

  • -o $outfile - The name and path of the output file

  • -p - Sets to paired end mode

  • -S 2 - The strand-specificity of the sequencing library (if this is unknown then this can be obtained from the Salmon log), number codes are: 0 (unstranded), 1 (stranded) and 2 (reversely stranded)

  • -T 1 - Sets the number of threads to be used (in this case 1), this number has to be the number of available cores on the server node

FeatureCounts successfully assigns at least 86% of the aligned reads to genes




Differential gene expression



The gene / transcript level quantification from each of the alignment methods will be assessed independently to investigate the differences between the two methods. Both sets of data will be subjected to the same analysis.

The differential gene expression analysis will be performed in R using the DESeq2 package.

Conventional alignment

First we will look at the counts obtained using HISAT2

Modifying counts file

Prior to reading the data into R, we will make a few formatting changes to the count file created by FeatureCounts. This will remove the command line header, remove some unwanted columns and remove the path from the sample names.

tail -n +2 $output_dir_master/featureCounts/counts.txt  | awk '{print $1"\t"$7"\t"$8"\t"$9"\t"$10}' | sed '1c\Geneid\tsample_1\tsample_2\tsample_3\tsample_4' > $output_dir_master/featureCounts/counts_modified.txt
  • tail -n +2 counts.txt - Remove top command line from the file

  • awk '{$2=$3=$4=$5=$6=""; print $0}' - Remove columns 2-6 from the file (these contain extra information about the gene identified which is not needed for DE)

  • sed '1c\Geneid\tSample1\tSample2\tSample3\tSample4' - Rename the the columns to remove the full file path

  • > $output_dir_master/featureCounts/counts_modified.txt - Write changes to new text file


Setting up R and reading in data

From this point forward, all code is written in R and should be run using the Rstudio server.

To connect to the Rstudio server open your web browser and navigate to http://10.100.75.81:8787 - you will need to login using your same username and password

#Loading R packages
library(DESeq2)

# Reading in data
count_table <- read.table("~/RNA_Seq_workshop/output/featureCounts/counts_modified.txt",sep = "\t",header = T,row.names = 1)

# Create sample table 
samples <- c("sample_1","sample_2","sample_3","sample_4")
status <- (c(rep("negative",2),rep("positive",2)))
sample_table <- data.frame(samples,status)
colnames(sample_table) <- c("sample","HER2_status")

After running this code you should have two tables stored in R, head() of these tables is shown below:

head(count_table)
sample_1 sample_2 sample_3 sample_4
ENSG00000223972 0.50 1.30 0.40 1.96
ENSG00000227232 39.11 63.82 23.57 129.15
ENSG00000278267 0.32 0.27 0.00 1.94
ENSG00000243485 0.00 0.60 0.00 0.00
ENSG00000284332 0.00 0.00 0.00 0.00
ENSG00000237613 0.00 0.00 0.00 0.00
head(sample_table)
sample HER2_status
sample_1 negative
sample_2 negative
sample_3 positive
sample_4 positive

Now that the data has been read into R, we can commence with setting up DESeq2


Setting up DESeq2

Before performing differential expression, we must first relevel the data. This re-leveling allows us to set one of our variables (either positive or negative HER2 status) as your reference level. This will mean that the fold change calculated during differential expression analysis will be in relation to this level. We will be setting negative HER2 status as our reference level, meaning that genes with a positive fold change will be upregulated in HER2 positive samples, whereas genes with a negative fold change will be down regulated in HER2 positive samples.

sample_table$HER2_status <- relevel(sample_table$HER2_status, ref = "negative")

Next we convert our data.frame object to a matrix (this is needed to pass the data to a DESeq DataSet) and in the process round our values. We will create our DESeq DataSet using the DESeqDataSetFromMatrix fucntion, providing it with our counts matrix, the sample_table data.frame as the coldata and an experimental design.

In this workshop we are looking to investigate the differences between HER2 negative and positive status, this means our experimental design is simply ~ HER2 status as this is the only variable we are assessing across the four samples. If other factors needed to be corrected for, these can be added in front of the experimental design. The experimental design in DESeq2 is extremely complicated and is not covered in this workshop, more information can be found in the reference manual.

counts_matrix <- as.matrix(round(count_table))
dds <- DESeqDataSetFromMatrix(countData = counts_matrix, colData = sample_table, design = ~ HER2_status)

Performing differential expression

Finally before calculating the differentially expressed genes, we filter out lowly expressed genes (with a total of less than 10 counts across all samples) which are unlikely to be informative as their expression is so low. Once complete we can run the differential expression.

The exact process and steps that DESeq2 uses to identify differentially expressed genes is not discussed in this workshop. Detailed information on these steps is available in the DESeq2 paper and in the reference manual, both of these should be easy to folow.

keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

res <- DESeq(dds)

We next extract the results form the DESeq object and only select genes based on having an adjusted P value less than 0.1 as well as ordering based on this value. The head() of this table is shown below.

results <- results(res)
resOrdered <- results[order(results$pvalue),]

results_table <- data.frame(results)
results_table <- results_table[!is.na(results_table$padj),]
significant_genes <- results_table[results_table$padj<0.05,]
significant_genes <- significant_genes[order(significant_genes$padj),]
head(significant_genes)
baseMean log2FoldChange lfcSE stat pvalue padj
ENSG00000186081 2592.7773 -11.759742 1.2434445 -9.457392 0 0
ENSG00000171346 1220.3208 -7.319406 0.7890561 -9.276154 0 0
ENSG00000099953 1162.1623 6.455407 0.7191874 8.975973 0 0
ENSG00000104332 5294.7574 -6.937969 0.8292936 -8.366119 0 0
ENSG00000060718 825.5602 7.540317 0.9619132 7.838874 0 0
ENSG00000196878 591.0853 -5.223538 0.7282132 -7.173089 0 0

Column discriptions

  • row names - Ensembl gene ID
  • baseMean - Mean of normalized counts of all samples
  • Log2FoldChange - Log2 fold change
  • lfcSE - Standard error of the log2 fold change
  • stat - Wald statistic: the log2FoldChange divided by lfcSE, which is compared to a standard Normal distribution to generate a two-tailed pvalue
  • pvalue - P Value
  • padj - Adjusted P value

Analysing the results

MA-Plots

MA-plots provide a good way to visualize differentially expressed genes, in relation to fold change and gene expression.

An MA-plot is a plot of log-fold change (M-values, i.e. the log of the ratio of level counts for each gene between the two groups) against the log-average (A-values, i.e. the average level counts for each gene across the two groups). Genes with similar expression in both groups will appear around the horizontal line y=0. Points will be colored red if the adjusted p value is less than 0.1 (Points which fall out of the window are plotted as open triangles pointing either up or down)

plotMA(results, ylim=c(-10,10))

This plot shows there are differentially expressed genes with both positive and negative fold changes over wide range of expression levels.

Adding gene names

At present, all genes are identified by their ENSEMBL ID, it can be useful to add a gene’s name to aid in the interpretation of results. To achive this, we will be using the BiomaRt package.

library(biomaRt)


ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")

all_genes <- getBM(attributes=c('ensembl_gene_id','external_gene_name'), mart = ensembl)
rownames(all_genes) <- all_genes$ensembl_gene_id

significant_genes$gene_name <- all_genes[rownames(significant_genes),2]

significant_genes <- significant_genes[,c(7,1,2,6)]


significant_genes[1:50,]
gene_name baseMean log2FoldChange padj
ENSG00000186081 KRT5 2592.77734 -11.759742 0.0000000
ENSG00000171346 KRT15 1220.32076 -7.319406 0.0000000
ENSG00000099953 MMP11 1162.16225 6.455407 0.0000000
ENSG00000104332 SFRP1 5294.75736 -6.937969 0.0000000
ENSG00000060718 COL11A1 825.56015 7.540317 0.0000000
ENSG00000196878 LAMB3 591.08526 -5.223538 0.0000000
ENSG00000185479 KRT6B 543.89650 -12.498816 0.0000000
ENSG00000256612 CYP2B7P 465.80753 7.642502 0.0000000
ENSG00000203727 SAMD5 279.58586 -5.452999 0.0000001
ENSG00000184613 NELL2 239.59809 5.329969 0.0000001
ENSG00000189377 CXCL17 653.05675 7.372097 0.0000003
ENSG00000091513 TF 163.84552 -7.038691 0.0000004
ENSG00000086548 CEACAM6 1592.73622 8.878492 0.0000005
ENSG00000100146 SOX10 158.56189 -10.721387 0.0000005
ENSG00000164403 SHROOM1 707.58383 6.078958 0.0000006
ENSG00000086570 FAT2 114.34027 -6.493976 0.0000006
ENSG00000135346 CGA 6659.36975 -29.204369 0.0000011
ENSG00000124205 EDN3 120.21826 -10.320539 0.0000029
ENSG00000081277 PKP1 832.20733 -7.251715 0.0000031
ENSG00000123500 COL10A1 374.67632 5.573005 0.0000049
ENSG00000108821 COL1A1 17231.53268 4.341870 0.0000049
ENSG00000170153 RNF150 180.42729 -5.352347 0.0000080
ENSG00000196557 CACNA1H 107.72297 5.030712 0.0000135
ENSG00000265190 ANXA8 97.06600 -10.011677 0.0000157
ENSG00000087494 PTHLH 697.97562 -7.225253 0.0000177
ENSG00000126778 SIX1 348.57785 4.888767 0.0000184
ENSG00000130635 COL5A1 1275.57345 3.691023 0.0000247
ENSG00000170465 KRT6C 131.17458 -10.446390 0.0000247
ENSG00000181143 MUC16 554.61841 -9.477250 0.0000247
ENSG00000105523 FAM83E 154.78336 4.616967 0.0000247
ENSG00000007038 PRSS21 133.04753 -7.445537 0.0000383
ENSG00000105664 COMP 334.46710 5.816290 0.0000488
ENSG00000171243 SOSTDC1 55.70537 -9.211363 0.0000521
ENSG00000182253 SYNM 470.87478 -4.565264 0.0000887
ENSG00000161634 DCD 433.73612 24.908585 0.0000954
ENSG00000198944 SOWAHA 85.82044 6.245748 0.0001012
ENSG00000283796 MIR6510 50.14004 -9.058845 0.0001128
ENSG00000170373 CST1 284.93104 9.309818 0.0001733
ENSG00000149295 DRD2 55.46971 -9.203865 0.0001983
ENSG00000172061 LRRC15 404.11505 5.413946 0.0002276
ENSG00000146281 PM20D2 135.54171 -4.616931 0.0002276
ENSG00000284180 MIR3606 85.78083 4.797578 0.0002345
ENSG00000058085 LAMC2 260.98492 -4.462525 0.0002530
ENSG00000257341 AL928654.3 362.96069 4.361005 0.0002720
ENSG00000134757 DSG3 49.23782 -9.035264 0.0002892
ENSG00000115112 TFCP2L1 458.42988 -5.087519 0.0003257
ENSG00000147255 IGSF1 73.68321 -9.616403 0.0003257
ENSG00000105852 PON3 94.92560 -6.955412 0.0003510
ENSG00000186474 KLK12 56.98827 9.333865 0.0004062
ENSG00000133048 CHI3L1 270.86949 -3.640278 0.0004149

Volcano Plot

A volcano plot is scatter plot which visualizes the fold change and signficance of the differentially expressed genes identified.

library(ggplot2)
plotting_table <- results_table[,c(2,6)]


# Set colour based on adjusted P value
plotting_table$colour <- "black"
plotting_table$colour[plotting_table$padj < 0.05] <- "blue"
plotting_table$colour[plotting_table$padj < 0.01] <- "red"

# talke log 10 values of p value
plotting_table$padj <- lapply(plotting_table$padj,function(x) -log10(x))

# convert column types 
plotting_table$colour <- factor(plotting_table$colour)
plotting_table$padj <- as.numeric(plotting_table$padj)

#plot volcano plot using ggplot
ggplot(plotting_table, aes(x=log2FoldChange, y=padj,colour=colour)) + 
  geom_point() + theme_bw() + 
  scale_color_manual(breaks = c("black", "blue", "red"),values=c("black", "blue", "red"),name="",
                     labels=c("Non-significant", "Adjusted P Value < 0.05", "Adjusted P Value < 0.01")) +
   xlab("Log2(FoldChange)") + ylab("-log10(Adjusted P Value)") +  ggtitle("Volcano plot of differentially expressed genes")

The volcano plot shows there are a large group of statistically significant genes with both positive and negative fold changes.

Transforming the data

For differential expression DESeq2 requires the raw expression data of genes. For downstream analysis however, it is advisable to transform the data, this will aid such tasks as visualization and clustering. The most comon transformation applied to data is to log transform it, DESeq2 has some more sophisticated transformations built in with the most widly used being the Variance Stabilizing Transformation (VST). This transformation removes the dependence of the variance on the mean and can be easily applied to data with the command below, more information on this transformation can be found here.

vsd <- vst(dds, blind=FALSE)

Principal Component Analysis

Principal Components Analysis (PCA) is a technique that finds underlying variables (known as principal components) that best differentiate your data points. This can be useful to determine how similar your samples are to each other.

plotPCA(vsd, intgroup=c("HER2_status"))

Ideally when looking at PCA plots samples from the same group should cluster together. This is unlikely to be seen in our data due to the small sample size.

Producing a heatmap to visualize gene expression

A useful way to visualize the gene expression between the samples investigated is to produce a heatmap. In this example we will group samples based on their HER2 status and plot the top 20 differentially expressed genes. We will cluster genes using the Ward method to group genes with similar expression across across the samples together.

library("pheatmap")

significant_genes_names <- row.names(significant_genes)
ordered <- sample_table[order(sample_table$HER2_status),]

# Extract sample information
col_bar <- as.data.frame(colData(dds)[,"HER2_status"])
rownames(col_bar) <- rownames(colData(dds))
colnames(col_bar) <- "HER2_status"

# Extract top 20 DE genes 
heatmap_data <- assay(vsd)[significant_genes_names[1:20],as.character(ordered$sample)]

#Add gene names
gene_names_heatmap <- all_genes[all_genes$ensembl_gene_id %in% rownames(heatmap_data),]
gene_names_heatmap <- gene_names_heatmap[match(rownames(heatmap_data),rownames(gene_names_heatmap)),]
rownames(heatmap_data) <- gene_names_heatmap$external_gene_name

# Plot heatmap
pheatmap(heatmap_data, cluster_rows=TRUE,clustering_method="ward.D2", show_rownames=TRUE,
         cluster_cols=FALSE, annotation_col=col_bar)

The VST data has been used in this heatmap giving a range of expression between 1 and 14. Genes with higher levels of expression are shown in red while genes with lower expression are shown in blue. The sample bar at the top gives information on the HER2 status of each of the 4 samples, with positive and negative samples grouped grouped together respectively. The clustering shows that there are groups of genes which are up regulated in on of the groups but down regulated in the other.


Pseudo-alignment

We will now complete the same differential gene expression analysis on the count data produced using Salmon. The only difference between the two methods is the steps taken to read the data into R. Salmon data can be read into R using the Tximport package. During this process, a tx2gene file is used. This file contains the gene IDs relating to the transcript IDs. A warning that approximately 7,600 transcripts are missing from the file will appear, this is because we are not quantifying transcripts which are non-coding to be consistant with the genes obtained via HISAT2.

library(tximport)
library(readr)
library(rjson)

# Reading Salmon data in

files <- file.path("~/RNA_Seq_workshop/resources/salmon" ,sample_table$sample,"quant.sf")

tx2gene <- read.table("~/RNA_Seq_workshop/resources/tx2gene/tx2gene.txt",sep = "\t",header = T)
txi <- tximport(files, type="salmon", tx2gene=tx2gene)

# Setting up DESeq2
dds_Txi <- DESeqDataSetFromTximport(txi,sample_table, design = ~ HER2_status)
dds_salmon <- DESeq(dds_Txi)


keep <- rowSums(counts(dds_salmon)) >= 10
dds_salmon <- dds_salmon[keep,]

# Running DESeq2
res_salmon <- DESeq(dds_salmon)


# Soritng DE results

results_salmon <- results(res_salmon)
resOrdered_salmon <- results_salmon[order(results_salmon$pvalue),]

results_table_salmon <- data.frame(resOrdered_salmon)
results_table_salmon <- results_table_salmon[!is.na(results_table_salmon$padj),]
significant_genes_salmon <- results_table_salmon[results_table_salmon$padj<0.05,]
significant_genes_salmon <- significant_genes_salmon[order(significant_genes_salmon$padj),]

Comparison between alignemnt methods

Now that we have two lists of differentially expressed, we can compare with the aid of a Venn diagram

library(VennDiagram)

salmon_genes <- table(rownames(significant_genes_salmon) %in% rownames(significant_genes))
hisat2_genes <- table(rownames(significant_genes)  %in% rownames(significant_genes_salmon))


 grid.newpage() #needed to create a new plot
 
Diagram <- draw.pairwise.venn((salmon_genes[[1]] + salmon_genes[[2]]), (hisat2_genes[[1]] + hisat2_genes[[2]]), salmon_genes[[2]], category = c("Salmon", "HISAT2"), lty = rep("blank", 
    2), fill = c("light blue", "pink"), alpha = rep(0.5, 2), cat.pos = c(0, 
    0), cat.dist = rep(0.025, 2), scaled = TRUE)




Workshop summary



In this workshop we have completed a workflow to analyse RNA sequencing data from the raw .fastq files to obtaining a list of differentially expressed genes. These differentially expressed genes could then be explored further with the aid of network and pathway analysis tools. Ordinarily you would expect to find a larger number of differentially expressed genes when looking between HER2 positive and negative samples. The reduction in genes identified is due in part to the limited sample size as well as the fact that only a small subsection of reads were taken from each sample.

This workflow is one of many which could be used to produce the same output. For each stage there are several tools which could be used so it imporant to look around and select the most approrate tool for your data and the computational resources you have availible.




Extention tasks

If you have time try to complete the following tasks by yourself, the code to complete these tasks is not supplied:

  1. Produce the same plots for the Salmon data as those produced for the HISAT2 data]
  2. Create three tables looking at the differences in differentially expressed (DE) genes:
    • Table 1: Only genes differentially expressed in the Salmon data
    • Table 2: Only genes differentially expressed in the HISAT2 data
    • Table 3: Only genes differentially expressed in both datasets




Appendix


The code contained in this section is for reference only and is for the parts of the pipeline you did not run yourself. DO NOT RUN THIS CODE

Creating MultiQC report

# Loading MultiQC

module load general/miniconda/4.3.21
source activate /users/software/multiqc-env

multiqc $output_dir_master/fastqc/initial \
        -o $output_dir_master/fastqc/initial \
        -n mulitqc_initial_fastqc_report
        
source  deactivate
  • $output_dir_master/fastqc/initial - Path to be searched by MultiQC for report files

  • -o $output_dir_master/fastqc/initial - The path to where the MultiQC report should be outputted to

  • -n mulitqc_initial_fastqc_report - The file name for the report produced


The single .html file produced by multiQC can be used to provide a useful overview for all samples. You can view this report using the command firefox $output_dir_master/fastqc/initial/mulitqc_initial_fastqc_report.html


Conventional Alignment

Creating genome index

# EXAMPLE CODE ONLY - DO NOT RUN

$histat2_build $genome_fa \
               $output_path_and_name
  • $histat2_build - Path to HISAT2 build wrapper script

  • $genome_fa - Path to the genome reference .fa file

  • $output_path_and_name - Path to where the genome index will be stored, including the basename for the files (the folder must be created before HISAT2 is run)

Pseudo-alignment

Creating a transcriptome index

# EXAMPLE CODE ONLY - DO NOT RUN

salmon index -t $transcriptome_reference \
    -i $path_to_index_folder \
    -k 31 \
  • index - Sets salmon to index generation mode

  • -t $transcriptome_reference - Path to the reference transcriptome .fa file

  • -i $path_to_index_folder - Path to where transcriptome index should be stored (folder must be created first)

  • -k 31 - Sets the k-mer length 31 - this number is recommended by the Salmon developers for reads above 75bp (more infomation on this can be found here)


Quantifying transcripts

# EXAMPLE CODE ONLY - DO NOT RUN

cd $trim_output
for dir in */ ; do # loop around sample directories
        # Make output dir
        outputdir=$output_dir_master/salmon/$dir
        mkdir -p $outputdir

        cd "$trim_output/$dir"
        forward="$trim_output/$dir"$(ls | grep 1_trimmed_paired.fq)
        reverse="$trim_output/$dir"$(ls | grep 2_trimmed_paired.fq)
        echo $outputdir
        # Salmon comand - exlained in detail in section below
        $salmon quant \
            -i $resources/transcriptome/salmon_index \
            -l A \
            -1 $forward   \
            -2 $reverse \
            -o $outputdir \
            -p 1 \
            --gcBias \
            --incompatPrior 0
        
done
  • quant - Sets salmon to quantification mode

  • -i $path_to_index_folder - Path to the folder containing the previously generated index

  • -l A - Specifying the library type - In this case we have set this to A which allows Salmon to automatically infer this (more details can be found here)

  • -1 $forward - Path to the forward trimmed read file

  • -2 $reverse - Path to the reverse trimmed read file

  • -o $outputdir - Path to where the output should be stored

  • -p 1 - Sets the number of threads to be used (in this case 1), this number has to be the number of available cores on the server node

  • --gcBias - Enable Salmon to learn and correct for fragment-level GC biases in the input data (more details can be found here)

  • --incompatPrior 0 - Sets Salmon to only consider mappings (or alignments) that are compatible with the prescribed or inferred library type (more details can be found here)